The aim of this study is to develop some forecasting approaches and to manipulate the data provided by Trendyol so that necessary interpretations can be made. During the study, seasonality for each product will be examined. Then necessary decomposition will be done to use ARIMA models. Furtherly, possible regressors will be checked and ARIMAX, SARIMAX models will be utilized. Best model is going to be chosen according to their performance and test results. Necessary interpretations will be made at each step.
Product names used in the work:
1)Yüz Temizleyici
2)Bebek Islak Mendili
3)Xiaomi Bluetooth Kulaklık
4)Fakir Süpürge
5)Tayt
6)Diş Fırçası
7)Mont
8)Bikini Model-1
9)Bikini Model-2
In this step, required libraries are called, necessary data tables are created and the other very beginning operations are executed.
library(data.table)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(mgcv)
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
library(fpp)
## Loading required package: forecast
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'forecast'
## The following object is masked from 'package:nlme':
##
## getResponse
## Loading required package: fma
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: tseries
require(gratia)
## Loading required package: gratia
library(readxl)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(forecast)
library(zoo)
library(stringr)
library(stats)
library(urca)
g13<-read_excel("g13.xlsx")
dat<-as.data.table(g13) # Main Data Table
yuzt<-dat[1:372,] # Data table which holds the data of Product-1 (Yüz Temizleyici)
mendil<-dat[373:744,] # Data table which holds the data of Product-2 (Bebek Islak Mendili)
xiaomi<-dat[745:1116,] # Data table which holds the data of Product-3 (Xiaomi Bluetooth Kulaklık)
fakir<-dat[1117:1488,] # Data table which holds the data of Product-4 (Fakir Süpürge)
tayt<-dat[1489:1860,] # Data table which holds the data of Product-5 (Tayt)
bik1<-dat[1861:2232,] # Data table which holds the data of Product-6 (Bikini Model-1)
firca<-dat[2233:2604,] # Data table which holds the data of Product-7 (Diş Fırçası)
mont<-dat[2605:2976,] # Data table which holds the data of Product-8 (Mont)
bik2<-dat[2977:3348,] # Data table which holds the data of Product-9 (Bikini Model-2)
In this step, seasonality of each product is analysed to make appropriate decomposition.
The plot below shows the sold count of product-1 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.
Table below shows autocorrelation values at each lag until lag 100.
It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 15th and 60th lag, autocorrelation values get higher and autocorrelation fades away at later lags. Normally, time series object of sold count for product-1 ought to be chosen at the frequency of 15, which will cover autocorrelation of lag 60 too. However, as it will be observed in the next step; if lag 7 is chosen, the decomposition results will be better compared to lag 15. And at that point, we can cover some other autocorrelation lags better with getting a more stationary data, so it would not be wrong to say that there is weekly seasonality even if autocorrelation function does not show us it directly.
As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(yuzt$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance and nearly constant mean. Its trend seems to increase with fluctuations which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts12=ts(yuzt$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")
The reason why acf is really high at lag 15 might be that in Turkey, most people get their salary either at day 1 or day 15 of related month, which justifies high acf value around lag 15. (Source) Probably, people tend to shop more after getting their salary. Even if people generally use credit cards, getting fresh money gives confidence to people to spend more on shopping products. So, another seasonality is observed roughly at each 15 days. However, as it is mentioned above, acf result of detrended series is worse now. Moreover, the graph below shows the plot of detrended values, which is worse than the one in frequency=7.
Detrended values (random) are less stationary in this case. They do not have constant variance and their means are less constant compared to previous one. Trend in this series seems more wavy but it tends to increase more. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.
Required codes are below:
ts13=ts(yuzt$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")
Detrended values are worst in this model. They show a nonstationary behavior due to changing variance and mean. Trend component seems to increase more compared to the other two and it fluctuates less. However, acf of detrended values are high and not as requested. Its random values are also the worst of 3 models. Seasonal component seems to have larger range, which changes results more.
For the further parts, we will continue with the model of frequency = 7.
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 3.AR(3) model can be proposed. For decreasing PACF part, ACF pilot does not show significant figures after lag 3, so MA(3) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(3) is creted below:
model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.1893 -0.1191 -0.3100 0.9748
## s.e. 0.0496 0.0503 0.0497 0.0127
##
## sigma^2 estimated as 0.09049: log likelihood = -79.89, aic = 169.78
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black). They are little lagged because of the nature of AR models.
Residuals show constant mean but variance differs sometimes. ACF and PACF plot still shows some autocorrelation signs.
MA(3) is creted below:
model12=arima(decomposed11$random,order=c(0,0,3))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 3))
##
## Coefficients:
## ma1 ma2 ma3 intercept
## 0.0885 -0.2125 -0.3677 0.9748
## s.e. 0.0472 0.0503 0.0440 0.0080
##
## sigma^2 estimated as 0.08918: log likelihood = -77.27, aic = 164.54
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too.
Residuals show constant mean but variance seems to differ more in this model. Eventhough ACF and PACF values are better in this model, they are still high.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal=FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(3,0,2) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 mean
## 0.2382 0.2147 -0.3872 -0.1166 -0.4240 0.9748
## s.e. 0.2073 0.1875 0.0779 0.2199 0.1803 0.0076
##
## sigma^2 estimated as 0.08708: log likelihood=-69.97
## AIC=153.93 AICc=154.25 BIC=181.25
model13=arima(decomposed11$random,order=c(3,0,2)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black).
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(3,0,2)")
Residuals show constant mean but variance differs sometimes. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.
AIC(model11)
## [1] 169.7801
BIC(model11)
## [1] 189.2933
AIC(model12)
## [1] 164.5391
BIC(model12)
## [1] 184.0523
AIC(model13)
## [1] 153.9338
BIC(model13)
## [1] 181.2523
As it is seen above, model13 which is autoarima with ARIMA(3.0.2) is the best model with lower values. ARIMA(3,0,2) should be selected for further steps.
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.2702096 17.51891 0.2361159 0.2361159
accu(ts11,tail(head(fitted_transformed12,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.2703676 17.21833 0.2320647 0.2320647
accu(ts11,tail(head(fitted_transformed13,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.2619928 16.86907 0.2273575 0.2273575
Again third result which is ARIMA(3,0,2) gives the best result with lower errors in all error tests methods.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
cor(yuzt$sold_count,yuzt$price)
## [1] -0.2299915
correlation(yuzt)
## Variable Correlation
## 1 visit count 0.3144539
## 2 basket count 0.8194234
## 3 favored count 0.4509207
## 4 category sold 0.6230038
## 5 category visit 0.1228725
## 6 category basket 0.2880863
## 7 category favored 0.6858087
## 8 category brand sold 0.3925289
Obviously there is high correlation between sold counts and those basket count, category sold and category favored. This result actually makes sense because people tend to buy what they favored or basketed. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potentiel regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(3,0,2),xreg=yuzt$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 2), xreg = yuzt$basket_count)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 intercept yuzt$basket_count
## 0.1656 0.1448 -0.3539 2e-04 -0.2923 0.8806 3e-04
## s.e. 0.1401 0.1554 0.0039 NaN NaN NaN NaN
##
## sigma^2 estimated as 0.07997: log likelihood = -57.27, aic = 130.53
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(3,0,2),xreg=yuzt$category_sold)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 2), xreg = yuzt$category_sold)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 intercept yuzt$category_sold
## -0.0960 -0.0457 -0.2361 0.3688 0.0609 0.8011 4e-04
## s.e. 0.1352 0.1309 NaN NaN NaN NaN NaN
##
## sigma^2 estimated as 0.07733: log likelihood = -51.05, aic = 118.1
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(3,0,2),xreg=yuzt$category_favored)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 2), xreg = yuzt$category_favored)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 intercept
## 0.1691 0.1876 -0.3685 -0.0422 -0.3731 0.8898
## s.e. 0.1974 0.1762 0.0707 0.2061 0.1654 0.0137
## yuzt$category_favored
## 0
## s.e. NaN
##
## sigma^2 estimated as 0.08231: log likelihood = -62.61, aic = 141.22
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.004031973 0.4117975 0.005550114 0.005550114
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.004048837 0.3541774 0.004773523 0.004773523
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.004150725 0.3935481 0.005304152 0.005304152
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.004010801 0.4028655 0.005429729 0.005429729
As it is seen in the results, they all have similar results but model arimax1 is slightly better in terms of lower WMAPE and other parameters. Moreover, model shows better performance with arimax1 (regressor as basket count) than arima model in previous step.
After all of tests and operations done above, arimax1 model gives best results and it is selected as final model.
In this step, seasonality of each product is analysed to make appropriate decomposition.
The plot below shows the sold count of product-2 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.
Table below shows autocorrelation values at each lag until lag 100.
It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 15th and 60th lag, autocorrelation values get higher and autocorrelation fades away at later lags. Normally, time series object of sold count for product-2 ought to be chosen at the frequency of 15, which will cover autocorrelation of lag 60 too. However, just like the previous case in product-1; if lag 7 is chosen, the decomposition results will be better compared to lag 15. And at that point, we can cover some other autocorrelation lags better with getting a more stationary data, so it would not be wrong to say that there is weekly seasonality even if autocorrelation function does not show us it directly.
As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(mendil$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance and nearly constant mean. Its trend seems to increase with fluctuations which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts12=ts(mendil$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")
Detrended values (random) are less stationary in this case. They do not have constant variance and their means are less constant compared to previous one. Trend in this series seems more wavy but it tends to increase more. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.
Required codes are below:
ts13=ts(mendil$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")
Detrended values are worst in this model. They show a nonstationary behavior due to changing variance and mean. Trend component seems to increase more compared to the other two and it fluctuates less. However, acf of detrended values are high and not as requested. Its random values are also the worst of 3 models. Seasonal component seems to have larger range, which changes results more.
For the further parts, we will continue with the model of frequency = 7.
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 3 .AR(3) model can be proposed. For decreasing PACF part, ACF pilot does not show significant figures after lag 3, so MA(3) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(3) is creted below:
model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.3581 -0.2436 -0.2027 0.9602
## s.e. 0.0511 0.0528 0.0510 0.0150
##
## sigma^2 estimated as 0.09778: log likelihood = -94.08, aic = 198.16
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.
Residuals show constant mean but variance differs sometimes. ACF and PACF plot still shows some autocorrelation signs.
MA(3) is creted below:
model12=arima(decomposed11$random,order=c(0,0,3))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 3))
##
## Coefficients:
## ma1 ma2 ma3 intercept
## 0.2405 -0.2537 -0.3837 0.9598
## s.e. 0.0483 0.0581 0.0469 0.0099
##
## sigma^2 estimated as 0.09676: log likelihood = -92.23, aic = 194.46
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too.
Residuals show nearly constant mean but variance seems to jump in this model. Eventhough ACF and PACF values are better in this model, they are still high.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal=FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(2,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 mean
## 1.0134 -0.5088 -0.7209 0.9597
## s.e. 0.0616 0.0450 0.0593 0.0091
##
## sigma^2 estimated as 0.09503: log likelihood=-86.95
## AIC=183.9 AICc=184.07 BIC=203.41
model13=arima(decomposed11$random,order=c(2,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black).
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(2,0,1)")
Residuals show constant mean but variance differs sometimes. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.
AIC(model11)
## [1] 198.1614
BIC(model11)
## [1] 217.6746
AIC(model12)
## [1] 194.4632
BIC(model12)
## [1] 213.9764
AIC(model13)
## [1] 183.899
BIC(model13)
## [1] 203.4122
As it is seen above, model13 which is autoarima with ARIMA(2.0.1) is the best model with lower values. ARIMA(2,0,1) should be selected for further steps.
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.2991687 104.9772 0.2725654 0.2725654
accu(ts11,tail(head(fitted_transformed12,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.3054549 102.8093 0.2669366 0.2669366
accu(ts11,tail(head(fitted_transformed13,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.298858 102.0573 0.2649839 0.2649839
Again third result which is ARIMA(2,0,1) gives the best result in MAD, MAPE, MADP and nearly equal to MA model in WMAPE,which is better than AR(3) model. According to performances, ARIMA model is better than MA model, MA model is better than AR model.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
cor(mendil$sold_count,mendil$price)
## [1] -0.5721201
correlation(mendil)
## Variable Correlation
## 1 visit count 0.3002766
## 2 basket count 0.8872024
## 3 favored count 0.2832415
## 4 category sold 0.9157251
## 5 category visit 0.4285587
## 6 category basket 0.1962214
## 7 category favored 0.7963313
## 8 category brand sold 0.1320124
Obviously there is high correlation between sold counts and those basket count, category sold and category favored. This result actually makes sense because people tend to buy what they favored or basketed. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potentiel regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=mendil$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = mendil$basket_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept mendil$basket_count
## 0.5269 0.0123 0.1696 0.6284 3e-04
## s.e. 0.3535 0.2216 0.3458 0.0303 NaN
##
## sigma^2 estimated as 0.06762: log likelihood = -26.61, aic = 65.22
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=mendil$category_sold)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = mendil$category_sold)
##
## Coefficients:
## ar1 ar2 ma1 intercept mendil$category_sold
## 0.4713 -0.2106 0.0706 0.7133 1e-04
## s.e. 0.1489 0.0604 0.1319 0.0302 NaN
##
## sigma^2 estimated as 0.07551: log likelihood = -46.71, aic = 105.43
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=mendil$category_favored)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = mendil$category_favored)
##
## Coefficients:
## ar1 ar2 ma1 intercept mendil$category_favored
## 0.6261 -0.3234 -0.1738 0.7596 0
## s.e. 0.1341 0.0466 0.1242 0.0306 NaN
##
## sigma^2 estimated as 0.08683: log likelihood = -72.27, aic = 156.53
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] 183.899
AIC(modelx1)
## [1] 65.22218
AIC(modelx2)
## [1] 105.4285
AIC(modelx3)
## [1] 156.5311
For initial comparision, model arimax with regressor as basket count seems to have better AIC value.
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.001917651 1.233079 0.003201597 0.003201597
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.002547509 1.521034 0.003949249 0.003949249
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.001909252 1.182135 0.003069323 0.003069323
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.002345371 1.412103 0.003666417 0.003666417
As it is seen in the results, they all have similar results but model arimax1 is slightly better in terms of lower WMAPE and other parameters. Moreover, model shows better performance with arimax1 (regressor as basket count) than arima model in previous step.
After all of tests and operations done above, arimax1 model gives best results and it is selected as final model.
In this step, seasonality of each product is analysed to make appropriate decomposition.
The plot below shows the sold count of product-3 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.
Table below shows autocorrelation values at each lag until lag 100.
It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 7th and 30th lag, autocorrelation values get higher and autocorrelation fades away at later lags. Normally, time series object of sold count for product-3 ought to be chosen at the frequency of 7 or 30 but we had better determine the seasonality after decomposition stage. There is weekly and maybe monthly seasonality for this product.
As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(xiaomi$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance (except for some jumps) and nearly constant mean. Its trend seems to gradually increase and decrease with fluctuations which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts13=ts(xiaomi$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")
Detrended values are worse in this model. They show a nonstationary behavior due to changing variance and mean. Trend component does not show any clue to increase or decrease. However, acf of detrended values are not that bad. But its random values are worse than weekly one. Seasonal component seems to have larger deviation, which may fluctuate results more.
For the further parts, we will continue with the model of frequency = 7.
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 3 .AR(3) model can be proposed. For decreasing PACF part, ACF pilot does not show significant figures after lag 3, so MA(3) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(3) is creted below:
model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.1838 -0.1684 -0.3338 0.9851
## s.e. 0.0492 0.0493 0.0491 0.0094
##
## sigma^2 estimated as 0.05641: log likelihood = 6.56, aic = -3.11
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.
Residuals show constant mean but variance increases.
MA(3) is creted below:
model12=arima(decomposed11$random,order=c(0,0,3))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 3))
##
## Coefficients:
## ma1 ma2 ma3 intercept
## 0.0280 -0.2659 -0.4802 0.9848
## s.e. 0.0423 0.0519 0.0475 0.0034
##
## sigma^2 estimated as 0.05211: log likelihood = 20.81, aic = -31.62
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too.
Residuals are really similar to the first model. AIC value is better in this model.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal = FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(5,0,3) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.2611 -0.6062 0.1784 -0.1918 -0.2221 -0.1929 0.3040 -0.5984
## s.e. 0.1403 0.1237 0.1092 0.0840 0.0689 0.1381 0.1086 0.0989
## mean
## 0.9851
## s.e. 0.0038
##
## sigma^2 estimated as 0.0501: log likelihood=32.35
## AIC=-44.69 AICc=-44.07 BIC=-5.67
model13=arima(decomposed11$random,order=c(5,0,3)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black).
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(5,0,3)")
Residuals show constant mean but variance differs sometimes. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.
AIC(model11)
## [1] -3.110113
BIC(model11)
## [1] 16.40305
AIC(model12)
## [1] -31.62326
BIC(model12)
## [1] -12.1101
AIC(model13)
## [1] -44.69223
BIC(model13)
## [1] -5.665892
As it is seen above, model13 which is autoarima with ARIMA(5.0.3) is the best in AIC value. However, MA model is better than ARIMA in BIC because of parameter punsihment of BIC calculation.
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.1860551 66.12454 0.1686712 0.1686712
accu(ts11,tail(head(fitted_transformed12,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.1814687 63.31901 0.1615148 0.1615148
accu(ts11,tail(head(fitted_transformed13,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.1754918 61.90513 0.1579083 0.1579083
ARIMA(5,0,3) gives the best result in all test parameters. According to performances, ARIMA model is better than MA model, MA model is better than AR model. ARIMA(5,0,3) should be selected for further steps.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
cor(xiaomi$sold_count,xiaomi$price)
## [1] -0.5127488
correlation(xiaomi)
## Variable Correlation
## 1 visit count 0.21153610
## 2 basket count 0.86567764
## 3 favored count 0.22808036
## 4 category sold 0.53232294
## 5 category visit 0.01179411
## 6 category basket 0.06057578
## 7 category favored 0.27347833
## 8 category brand sold 0.06973421
Obviously there is high correlation between sold counts and those basket count, category sold and price. This result actually makes sense because people tend to buy what they favored or basketed.Moreover, high prices makes people buy less. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potentiel regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(5,0,3),xreg=xiaomi$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(5, 0, 3), xreg = xiaomi$basket_count)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.2035 -0.7361 -0.0094 -0.1607 -0.3037 -0.0914 0.4821 -0.3312
## s.e. 0.1565 0.1304 0.1136 0.0761 0.0642 0.1576 0.1160 0.0926
## intercept xiaomi$basket_count
## 0.8694 1e-04
## s.e. 0.0091 NaN
##
## sigma^2 estimated as 0.04347: log likelihood = 53.98, aic = -85.97
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(5,0,3),xreg=xiaomi$category_sold)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(5, 0, 3), xreg = xiaomi$category_sold)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## -0.0373 0.2239 0.5257 -0.1746 0.2020 0.6214 0.0901 -0.5859
## s.e. 0.1237 0.0956 0.0976 0.0958 0.0579 0.1326 0.1572 0.1344
## intercept xiaomi$category_sold
## 0.4151 7e-04
## s.e. 0.0697 1e-04
##
## sigma^2 estimated as 0.03922: log likelihood = 72.69, aic = -123.39
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(5,0,3),xreg=xiaomi$price)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(5, 0, 3), xreg = xiaomi$price)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.2516 -0.6212 0.1537 -0.1925 -0.2367 -0.1888 0.3213 -0.5803
## s.e. 0.1470 0.1232 0.1084 0.0851 0.0686 0.1467 0.1095 0.1029
## intercept xiaomi$price
## 1.1520 -0.0012
## s.e. 0.0557 0.0004
##
## sigma^2 estimated as 0.04731: log likelihood = 38.31, aic = -54.61
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] -44.69223
AIC(modelx1)
## [1] -85.96773
AIC(modelx2)
## [1] -123.3877
AIC(modelx3)
## [1] -54.6138
For initial comparision, although all AIC values are negative and quite few, model arimax with regressor as category sold seems to have better AIC value. However, test results may differ.
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.003007114 1.334838 0.003404918 0.003404918
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.002735178 1.240497 0.003164274 0.003164274
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.003638369 1.522088 0.003882558 0.003882558
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.002616265 1.19986 0.003060615 0.003060615
To interpret the test, model arimax3 (arimax with regressor as price) gives the least parameter results. In all parameters such as WMAPE, arimax3 is the best eventhough its relatively higher AIC value.
After all of tests and operations done above, arimax3 (regressor as price) model gives best results and it is selected as final model.
In this step, seasonality of each product is analysed to make appropriate decomposition.
The plot below shows the sold count of product-4 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.
Table below shows autocorrelation values at each lag until lag 100.
It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 7th and 14th lag, autocorrelation values get higher and autocorrelation fades away at later lags. But product-4 should be chosen at the frequency of 7 because if we choose lag7 it may cover lag 14 as well. So there is weekly seasonality.
As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(fakir$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance (except for some jumps) and nearly constant mean. Its trend does not tell us anything important. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts12=ts(fakir$sold_count, frequency = 14)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=14")
plot(decomposed12)
Autocorrelation seems higher than the previous one. Reason might be that we do not cover weekly pattern but if we choose lag 7, we can cover lag 14. So frequency=14 is worse than frequency=7.
Detrended values (random) are less stationary in this case. They do not have constant variance and their means are less constant compared to previous one. Trend in this series seems more wavy but it tends to decrease more. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.
Required codes are below:
ts13=ts(fakir$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")
Detrended values are worse in this model. They show a nonstationary behavior due to changing variance and mean. Trend component does not show any clue to increase or decrease because it decreases first then increases in the middle. Acf of detrended values are not very well. But its random values are worse than weekly one because of changing variance and barely constant mean. Seasonal component seems to have larger terms because of larger frequency.
For the further parts, we will continue with the model of frequency = 7.
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 3 .AR(3) model can be proposed. For decreasing PACF part, ACF pilot does not show significant figures after lag 3, so MA(3) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(3) is creted below:
model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.1085 -0.0945 -0.3236 0.9801
## s.e. 0.0494 0.0495 0.0500 0.0108
##
## sigma^2 estimated as 0.07311: log likelihood = -40.83, aic = 91.67
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.
Residuals show constant mean but variance increases.
MA(3) is creted below:
model12=arima(decomposed11$random,order=c(0,0,3))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 3))
##
## Coefficients:
## ma1 ma2 ma3 intercept
## -0.0202 -0.2480 -0.3644 0.9800
## s.e. 0.0478 0.0487 0.0430 0.0051
##
## sigma^2 estimated as 0.07012: log likelihood = -33.33, aic = 76.65
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too.
Residuals are closer to 0 in this model rather than first model. AIC value is better in this model.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal = FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(4,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 mean
## 0.7425 -0.1878 -0.2388 0.0305 -0.7569 0.9799
## s.e. 0.0810 0.0639 0.0643 0.0638 0.0613 0.0051
##
## sigma^2 estimated as 0.06838: log likelihood=-25.8
## AIC=65.61 AICc=65.92 BIC=92.93
model13=arima(decomposed11$random,order=c(4,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black).
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(4,0,1)")
Residuals show constant mean but variance differs sometimes. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.
AIC(model11)
## [1] 91.6659
BIC(model11)
## [1] 111.1791
AIC(model12)
## [1] 76.65494
BIC(model12)
## [1] 96.1681
AIC(model13)
## [1] 65.60766
BIC(model13)
## [1] 92.9261
As it is seen above, model13 which is autoarima with ARIMA(4.0.1) is the best in AIC value. Second best model is MA(3).
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 39.72581 39.17959 0.9862504 0.2396726 9.1253 0.2297071 0.2297071
accu(ts11,tail(head(fitted_transformed12,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 39.72581 39.17959 0.9862504 0.2376557 8.985335 0.2261838 0.2261838
accu(ts11,tail(head(fitted_transformed13,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 39.72581 39.17959 0.9862504 0.2304129 8.615689 0.2168789 0.2168789
ARIMA(4,0,1) gives the best result in all test parameters. According to performances, ARIMA model is better than MA model, MA model is better than AR model. ARIMA(4,0,1) should be selected for further steps.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
cor(fakir$sold_count,fakir$price)
## [1] -0.32637
correlation(fakir)
## Variable Correlation
## 1 visit count -0.102187158
## 2 basket count 0.866866490
## 3 favored count -0.145043073
## 4 category sold 0.759162610
## 5 category visit 0.004137553
## 6 category basket -0.122458472
## 7 category favored 0.567747022
## 8 category brand sold -0.170524082
Obviously there is high correlation between sold counts and those basket count, category sold and category favored. This result actually makes sense because people tend to buy what they favored or basketed. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potentiel regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(4,0,1),xreg=fakir$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(4, 0, 1), xreg = fakir$basket_count)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 intercept fakir$basket_count
## 0.6858 -0.1792 -0.2419 0.0176 -0.6888 0.9641 1e-04
## s.e. 0.1246 0.0635 0.0628 0.0689 0.1272 0.0196 2e-04
##
## sigma^2 estimated as 0.06706: log likelihood = -25.18, aic = 66.37
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(4,0,1),xreg=fakir$category_sold)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(4, 0, 1), xreg = fakir$category_sold)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 intercept fakir$category_sold
## 0.5079 -0.1532 -0.2452 0.0022 -0.4742 0.9095 4e-04
## s.e. 0.2094 0.0633 0.0596 0.0809 0.2270 0.0285 2e-04
##
## sigma^2 estimated as 0.06431: log likelihood = -17.39, aic = 50.78
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(4,0,1),xreg=fakir$category_favored)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(4, 0, 1), xreg = fakir$category_favored)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 intercept
## 0.7029 -0.1821 -0.2399 0.0218 -0.7198 0.9571
## s.e. 0.0803 0.0625 0.0632 0.0655 0.0555 0.0069
## fakir$category_favored
## 0
## s.e. NaN
##
## sigma^2 estimated as 0.06646: log likelihood = -23.58, aic = 63.17
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] 65.60766
AIC(modelx1)
## [1] 66.36656
AIC(modelx2)
## [1] 50.77947
AIC(modelx3)
## [1] 63.16558
For initial comparision, model arimax2 (regressor as category sold) has the least AIC value. This might be a sign for selection.
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
## n mean sd CV MAPE MAD MADP
## 1 372 39.72581 39.17959 0.9862504 0.005127835 0.05148335 0.001295967
## WMAPE
## 1 0.001295967
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 39.72581 39.17959 0.9862504 0.00489123 0.04979956 0.001253582 0.001253582
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
## n mean sd CV MAPE MAD MADP
## 1 372 39.72581 39.17959 0.9862504 0.004614496 0.04775374 0.001202083
## WMAPE
## 1 0.001202083
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
## n mean sd CV MAPE MAD MADP
## 1 372 39.72581 39.17959 0.9862504 0.005139033 0.05145274 0.001295197
## WMAPE
## 1 0.001295197
To interpret the test, all 4 models gives really close results with high performances. We actually can choose whichever we request. Eventhough arimax2 has the least AIC value, test results are close and in order to prevent possible forecast mistakes, model should be chosen as the simplest one, which is the first, classic arima(4,0,1) model without regressors.
After all of tests and operations done above, arima(4,0,1) model is selected as final model due to simplicity.
In this step, seasonality of each product is analysed to make appropriate decomposition.
The plot below shows the sold count of product-5 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.
Table below shows autocorrelation values at each lag until lag 100.
It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 14th and 15th lag, autocorrelation values get higher and autocorrelation fades away at later lags. But frequency of 7 will be used too in decomposition part because it has given best results so far in the other products. Finally, it would not be wrong to say that there is similar pattern at each 14 days.
As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(tayt$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values can be counted as stationary with stable variance and approximately constant mean. Its trend does not tell us anything important being wavy. But model might be still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts13=ts(tayt$sold_count, frequency = 14)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=14")
plot(decomposed13)
Autocorrelation seems little higher than the previous one.
Detrended values (random) are less stationary in this case. They do not have that constant variance and their means are less constant compared to previous one. Trend in this series is not linear, which makes forecasting suffer. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values show signs of nonstationarity.
Required codes are below:
ts12=ts(tayt$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")
plot(decomposed12)
Detrended values are worse in this model. They show a nonstationary behavior due to changing variance and mean. Trend component is again not linear it is more like fluctuating. Acf of detrended values are not bad. Nevertheless, its random values are worse than weekly one because of changing variance and barely constant mean. Seasonal component seems to have larger terms in number because of larger frequency.
For the further parts, eventhough they give not very appropriate results, we will continue with the model of frequency = 7 due to slightly better detrended values compared to other options (Slightly more stationary with more constant mean and variance).
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 2 .AR(2) model can be proposed (autocorrelation is negative at lag 2, this is another sign for AR(2)). For decreasing PACF part, ACF pilot does not show significant figures after lag 5, so MA(5) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(2) is creted below:
model11=arima(decomposed11$random,order=c(2,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 0))
##
## Coefficients:
## ar1 ar2 intercept
## 0.5551 -0.3861 0.9639
## s.e. 0.0488 0.0492 0.0182
##
## sigma^2 estimated as 0.08323: log likelihood = -64.61, aic = 137.22
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.
Residuals show constant mean but variance changes a lot. There is still some autocorrelation from ACF and PACF
MA(5) is creted below:
model12=arima(decomposed11$random,order=c(0,0,5))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 5))
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 intercept
## 0.4285 -0.1421 -0.3891 -0.2371 -0.0751 0.9629
## s.e. 0.0522 0.0578 0.0520 0.0532 0.0526 0.0086
##
## sigma^2 estimated as 0.07724: log likelihood = -51.05, aic = 116.09
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too.
Residuals look same as AR(2) model. But from summaries above, AIC value is better (lower) in this model. ACF and PACF plots shows autocorrelation handled better in this model.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal = FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(2,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 mean
## 1.0850 -0.6042 -0.6853 0.9633
## s.e. 0.0534 0.0426 0.0515 0.0087
##
## sigma^2 estimated as 0.07533: log likelihood=-44.56
## AIC=99.11 AICc=99.28 BIC=118.63
model13=arima(decomposed11$random,order=c(2,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black).
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(2,0,1)")
Residuals are still similar to previous 2 models. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.
AIC(model11)
## [1] 137.224
BIC(model11)
## [1] 152.8346
AIC(model12)
## [1] 116.0926
BIC(model12)
## [1] 143.411
AIC(model13)
## [1] 99.11418
BIC(model13)
## [1] 118.6273
As it is seen above, model13 which is autoarima with ARIMA(2.0.1) is the best in AIC and BIC value. Second best model is MA(5).
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.2689123 233.0372 0.2715396 0.2715396
accu(ts11,tail(head(fitted_transformed12,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.2695938 222.8493 0.2596685 0.2596685
accu(ts11,tail(head(fitted_transformed13,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.2641632 217.3164 0.2532214 0.2532214
Test results are quite little and difference between them are close but if we have to choose one, ARIMA(2,0,1) gives slightly better result in all test parameters. According to performances, ARIMA model is better than MA model, MA model is better than AR model. ARIMA(2,0,1) should be selected for further steps.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
cor(tayt$sold_count,tayt$price)
## [1] -0.2583435
correlation(tayt)
## Variable Correlation
## 1 visit count 0.05290733
## 2 basket count 0.83720969
## 3 favored count 0.16874292
## 4 category sold 0.90004658
## 5 category visit 0.05368907
## 6 category basket -0.08685947
## 7 category favored 0.57715502
## 8 category brand sold -0.03864685
Obviously there is high correlation between sold counts and those basket count, category sold and category favored. This result actually makes sense because people tend to buy what they favored or basketed. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potentiel regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=tayt$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = tayt$basket_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept tayt$basket_count
## 0.9270 -0.4999 -0.4183 0.8575 0
## s.e. 0.0778 0.0390 0.0688 0.0199 NaN
##
## sigma^2 estimated as 0.07117: log likelihood = -35.99, aic = 83.99
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=tayt$category_sold)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = tayt$category_sold)
##
## Coefficients:
## ar1 ar2 ma1 intercept tayt$category_sold
## 0.6093 -0.0576 0.2404 0.6511 2e-04
## s.e. 0.1538 0.1151 0.1450 0.0398 NaN
##
## sigma^2 estimated as 0.05661: log likelihood = 5.79, aic = 0.42
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=tayt$category_favored)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = tayt$category_favored)
##
## Coefficients:
## ar1 ar2 ma1 intercept tayt$category_favored
## 1.0271 -0.5787 -0.6007 0.8578 0
## s.e. 0.0526 0.0405 0.0372 0.0211 NaN
##
## sigma^2 estimated as 0.07161: log likelihood = -37.22, aic = 86.44
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] 99.11418
AIC(modelx1)
## [1] 83.98627
AIC(modelx2)
## [1] 0.4222833
AIC(modelx3)
## [1] 86.43665
For initial comparision, model arimax2 (regressor as category sold) has absolutely the least AIC value. If test results for last 1 week does not change the condition, it seems arimax2 is best option to select.
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.001727326 0.4430522 0.0005162533 0.0005162533
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.0019466 0.5126242 0.00059732 0.00059732
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.002237589 0.5641885 0.0006574038 0.0006574038
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.001601462 0.4117527 0.0004797825 0.0004797825
To interpret the test, all 4 models gives really close results with high performances. Actually model arimax3 (regressor as category favored) gives the lowest WMAPE,MADP,MAD values but the difference between them is neglectable. So what to choose can be decided according to their AIC values. Arimax2 (regressor as category sold) has given remarkably low AIC value which is close to 0 (others were around 80). So we improve the model to arimax model with taking category sold as regressor.
After all of tests and operations done above, arimax2 (regressor as category sold) model is selected as final model due to outstandingly low AIC value.
In this step, seasonality of each product is analysed to make appropriate decomposition.
The plot below shows the sold count of product-6 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.
Table below shows autocorrelation values at each lag until lag 100.
In this case, autocorrelation is too much at each lag. But frequency of 7,14 and 30 will be examined in decomposition part and seasonality will be selected accordingly. As it will be appeared in the next part, frequancy=7 gives best stationary case and lowest acf values after decomposition. So it would not be wrong to say that there is approximately weekly seasonality.
As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(firca$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series seems really low and acceptable except for a few lags. As it is seen in the graph for random (detrended) values are not quite stationary with decreasing variance and changing mean. Its trend tend to increase. But model might be still selected because the other options are even worse. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts13=ts(firca$sold_count, frequency = 14)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=14")
plot(decomposed13)
Autocorrelation seems little higher than the previous one but still better than the one before decomposition.
Detrended values (random) are even less stationary in this case. They have jumps in variance which decreases too. But their means are more constant compared to previous one. Trend in this series is inclined to go up. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values show signs of nonstationarity more than the one with frequancy=7.
Required codes are below:
ts12=ts(firca$sold_count, frequency = 30)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=30")
plot(decomposed12)
Detrended values are the worst in this model. They show clear nonstationary behavior due to changing variance and mean. Trend component is again inclined to increase. Acf of detrended values are not quite bad. Nevertheless, its random values are worse than weekly one because of changing variance and inconstant mean. Seasonal component seems to have larger terms in number because of larger frequency.
For the further parts, eventhough all of 3 are not quite good, frequency=7 is acceptable and the best one.So, we will continue with the model of frequency = 7 due to slightly better detrended values compared to other options (Slightly more stationary with more constant mean and variance).
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 2 .AR(2) model can be proposed (autocorrelation is negative at lag 2, this is another sign for AR(2)). For decreasing PACF part, ACF pilot does not show significant figures after lag 5, so MA(5) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(3) is creted below:
model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.3530 -0.1543 -0.1970 0.9439
## s.e. 0.0512 0.0538 0.0511 0.0212
##
## sigma^2 estimated as 0.1637: log likelihood = -188.37, aic = 386.75
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.
Residuals show constant mean but variance changes at some points. There is still some autocorrelation from ACF and PACF
MA(5) is creted below:
model12=arima(decomposed11$random,order=c(0,0,5))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 5))
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 intercept
## 0.3168 -0.1039 -0.2983 -0.1865 -0.0843 0.9434
## s.e. 0.0525 0.0522 0.0518 0.0482 0.0520 0.0135
##
## sigma^2 estimated as 0.1589: log likelihood = -182.91, aic = 379.82
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too.
Residuals look similar to AR(3) model with little better variance. But from summaries above, AIC value is better (lower) in this model. ACF and PACF plots shows autocorrelation is handled better in this model.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal = FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(2,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 mean
## 1.0468 -0.4621 -0.7305 0.9436
## s.e. 0.0645 0.0462 0.0580 0.0136
##
## sigma^2 estimated as 0.1598: log likelihood=-182
## AIC=374.01 AICc=374.17 BIC=393.52
model13=arima(decomposed11$random,order=c(2,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black).
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(2,0,1)")
Residuals are still similar to previous 2 models. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF. Sometimes auto.arima may not be able to find best model but in this condition, it worked very well.
AIC(model11)
## [1] 386.7499
BIC(model11)
## [1] 406.2631
AIC(model12)
## [1] 379.822
BIC(model12)
## [1] 407.1404
AIC(model13)
## [1] 374.0077
BIC(model13)
## [1] 393.5209
As it is seen above, model13 which is autoarima with ARIMA(2.0.1) is the best in AIC and BIC value. Second best model is MA(5).
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 Inf 20.02766 0.217197 0.217197
accu(ts11,tail(head(fitted_transformed12,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 Inf 19.9664 0.2165326 0.2165326
accu(ts11,tail(head(fitted_transformed13,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 Inf 19.68207 0.2134491 0.2134491
Test results tell all 3 models are close to each other but arima(2,0,1) model is slightly better with higher performance. It has lower error terms. So it will be continued with arima(2,0,1).
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
cor(firca$sold_count,firca$price)
## [1] NA
correlation(firca)
## Variable Correlation
## 1 visit count 0.8409331
## 2 basket count 0.9515283
## 3 favored count 0.7896215
## 4 category sold 0.3998297
## 5 category visit 0.1229495
## 6 category basket 0.5771166
## 7 category favored 0.4229052
## 8 category brand sold 0.4911847
Obviously there is high correlation between sold counts and those visit count, basket count and favored count. Price data is dirty in this product so best operation to do is not to use price as regressor. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potentiel regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=firca$visit_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = firca$visit_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept firca$visit_count
## 1.0622 -0.4665 -0.7616 0.9207 0
## s.e. 0.0279 0.0344 NaN NaN NaN
##
## sigma^2 estimated as 0.1545: log likelihood = -177.89, aic = 367.79
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=firca$basket_count)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = firca$basket_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept firca$basket_count
## 1.0356 -0.4542 -0.7256 0.8750 2e-04
## s.e. 0.0703 0.0485 0.0756 0.0232 1e-04
##
## sigma^2 estimated as 0.1478: log likelihood = -169.67, aic = 351.34
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=firca$favored_count)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = firca$favored_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept firca$favored_count
## 1.0578 -0.4650 -0.7599 0.9097 1e-04
## s.e. 0.0615 0.0464 0.0542 0.0158 1e-04
##
## sigma^2 estimated as 0.1531: log likelihood = -176.22, aic = 364.44
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] 374.0077
AIC(modelx1)
## [1] 367.7873
AIC(modelx2)
## [1] 351.3435
AIC(modelx3)
## [1] 364.4441
For initial comparision, model arimax2 (regressor as basket count,which has the most correlation value with 0.95) has the least AIC value. If test results for last 1 week does not change the condition, it seems arimax2 is best option to select.
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 0.004081113 0.725805 0.007871245 0.007871245
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 0.004215994 0.6889001 0.007471018 0.007471018
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 0.004182911 0.6607526 0.007165762 0.007165762
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 0.004102785 0.6787281 0.007360704 0.007360704
To interpret the test model arimax2 (regressor as basket count) gives the lowest WMAPE,MADP,MAD values. So model is enhanced with new regressor.
After all of tests and operations done above, arimax2 (regressor as basket count) model is selected as final model.
ts.plot(mont$sold_count)
It is crystal clear that number of sales for this product is 0 most of the time in our data. Considering that this product is mainly used in a particularly small interval of a year, it makes sense that we only see number of sales bigger than 0 in the dates which are near to winter months. Now, we will plot basket_count data for this product.
ts.plot(mont$basket_count)
Again, there a lot of 0 values for this data.
acf(mont$sold_count, lag=100 ,main="Mont")
We will take a closer look to this data by decreasing the lag parameter.
acf(mont$sold_count, lag=30 ,main="Mont")
We see that frequencies of the peak points are not the same (there is no obvious pattern). And there is no remarkable autocorrelation at important lags like 7,15,30. Therefore, we cannot make comments about seasonality.
ARIMA models are build on time series data and these models make predictions using the previous values, errors etc. However, our data for the mont product does not have sufficient amount of values for sold_count. That is why we cannot build and test ARIMA models for this product. Unfortunately, later steps will not be taken.
The aim of this study is to develop some forecasting approaches and to manipulate the data provided by Trendyol so that necessary interpretations can be made. During the study, seasonality for each product will be examined. Then necessary decomposition will be done to use ARIMA models. Furtherly, possible regressors will be checked and ARIMAX, SARIMAX models will be utilized. Best model is going to be chosen according to their performance and test results. Necessary interpretations will be made at each step.
Product names used in the work:
1)Yüz Temizleyici
2)Bebek Islak Mendili
3)Xiaomi Bluetooth Kulaklık
4)Fakir Süpürge
5)Tayt
6)Bikini Model-1
7)Diş Fırçası
8)Mont
9)Bikini Model-2
In this step, required libraries are called, necessary data tables are created and the other very beginning operations are executed.
library(data.table)
library(ggplot2)
library(dplyr)
require(mgcv)
library(fpp)
require(gratia)
library(readxl)
library(lubridate)
library(forecast)
library(zoo)
library(stringr)
library(stats)
library(urca)
g13<-read_excel("g13.xlsx")
dat<-as.data.table(g13) # Main Data Table
yuzt<-dat[1:372,] # Data table which holds the data of Product-1 (Yüz Temizleyici)
mendil<-dat[373:744,] # Data table which holds the data of Product-2 (Bebek Islak Mendili)
xiaomi<-dat[745:1116,] # Data table which holds the data of Product-3 (Xiaomi Bluetooth Kulaklık)
fakir<-dat[1117:1488,] # Data table which holds the data of Product-4 (Fakir Süpürge)
tayt<-dat[1489:1860,] # Data table which holds the data of Product-5 (Tayt)
bik1<-dat[1861:2232,] # Data table which holds the data of Product-6 (Bikini Model-1)
bik_1<-dat[2132:2232,] # Data table without the null values of Product-6
firca<-dat[2233:2604,] # Data table which holds the data of Product-7 (Diş Fırçası)
mont<-dat[2605:2976,] # Data table which holds the data of Product-8 (Mont)
bik2<-dat[2977:3348,] # Data table which holds the data of Product-9 (Bikini Model-2)
In this step, seasonality of each product is analyzed to make appropriate decomposition.
The plot below shows the sold count of product-8 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days.The variance seems not so constant. That idea direct us to check autocorrelation so that determine related seasonality.
Table below shows autocorrelation values at each lag until lag 100.
It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 10th and 44th lag, autocorrelation values get higher and autocorrelation fades away at later lags. Normally, time series object of sold count for product-8 ought to be chosen at the frequency of 10, which will cover other autocorrelations too. However, as it will be observed in the next step; if lag 7 is chosen, the decomposition results will be better compared to lag 10. And at that point, we can cover some other autocorrelation lags better with getting a more stationary data, so it would not be wrong to say that there is weekly seasonality even if autocorrelation function does not show us it directly.
As far as it is seen below, there is a decreasing then increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(bik_1$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series are values that are not above the significance levels. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance and nearly constant mean. Its trend seems to increase with fluctuations in the later parts of the time series data which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts12=ts(bik_1$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")
The reason why acf values are higher at frequency 15 might be that in Turkey, most people get their salary either at day 1 or day 15 of related month. (Source) This reality may give us low acf values but this product did not give the values we want. Probably, people tend to shop more after getting their salary. Even if people generally use credit cards, getting fresh money gives confidence to people to spend more on shopping products. However, as it is mentioned above, acf result of detrended series is worse now. Moreover, the graph below shows the plot of detrended values, which is worse than the one in frequency=7.
Detrended values (random) are less stationary in this case. They do not have constant variance and their means are less constant compared to previous one. Trend in this series are not so fluctuating but it moves up an down through time. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.
Required codes are below:
ts13=ts(bik_1$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")
Detrended values are worst in this model. They show a nonstationary behavior due to changing variance and nonzero mean. Trend component seems to increase more compared to the other two and it fluctuates less. However, first two acf of detrended values are high and not as requested. Its random values are also the worst of 3 models. Seasonal component seems to have larger range, which changes results more.
For the further parts, we will continue with the model of frequency = 7.
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, both have high correlations in some parts of the plots. For decreasing ACF part, PACF plot does not show significant figures after lag 3. AR(3) model can be proposed. For decreasing PACF part, ACF plot does not show significant figures after lag 2, so MA(2) is proposed. We can also try auto.arima and compare these 3 to select the best model. It will probably the 3rd model with the auto.arima gives us the best model to build on. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(3) is creted below:
model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.2455 -0.3144 -0.1884 0.9900
## s.e. 0.1012 0.0982 0.1003 0.0232
##
## sigma^2 estimated as 0.07927: log likelihood = -14.62, aic = 39.24
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are close to real sales amount (black). However this is a simple model and does not fit a high percentage amount of data points. They are little lagged because of the nature of AR models.
Residuals show constant mean but variance differs sometimes. ACF and PACF plot still shows some autocorrelation signs that are above the significance levels.
MA(2) is creted below:
model12=arima(decomposed11$random,order=c(0,0,2))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 2))
##
## Coefficients:
## ma1 ma2 intercept
## -0.2064 -0.6247 0.9907
## s.e. 0.0739 0.0703 0.0055
##
## sigma^2 estimated as 0.08212: log likelihood = -16.74, aic = 41.49
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too. It is lagged due to the nature of the MA models.
Residuals show constant mean but variance seems higher first then lower in the later parts in this model. Eventhough ACF and PACF values are better in this model, they are still high.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal=FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(2,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 mean
## 0.9430 -0.4970 -0.9058 0.9906
## s.e. 0.0924 0.0895 0.0542 0.0051
##
## sigma^2 estimated as 0.06996: log likelihood=-7.22
## AIC=24.44 AICc=25.11 BIC=37.21
model13=arima(decomposed11$random,order=c(2,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black). The model fits more when variance is low. Outlier points lowers the fitting point of our model.
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(2,0,1)")
Residuals show constant mean but variance differs sometimes but better than the two previous models. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.
AIC(model11)
## [1] 39.24273
BIC(model11)
## [1] 52.01212
AIC(model12)
## [1] 41.48683
BIC(model12)
## [1] 51.70234
AIC(model13)
## [1] 24.44036
BIC(model13)
## [1] 37.20975
As it is seen above, model13 which is autoarima with ARIMA(2,0,1) is the best model with lower values. ARIMA(2,0,1) should be selected for further steps.
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,98),95))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.2446731 6.699741 0.2042481 0.2042481
accu(ts11,tail(head(fitted_transformed12,98),95))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.2386745 6.370985 0.1942256 0.1942256
accu(ts11,tail(head(fitted_transformed13,98),95))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.2253659 6.077611 0.1852818 0.1852818
Again third result which is ARIMA(2,0,1) gives the best result with lower errors in MAPE, MAD, MADP, WMAPE tests. The remaining tests are equal in all three models due to the saem data we use in all our models.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
correlation(bik_1)
## Variable Correlation
## 1 visit count 0.7099228
## 2 basket count 0.7960412
## 3 favored count 0.4806033
## 4 category sold 0.6833365
## 5 category visit 0.2349762
## 6 category basket 0.4955038
## 7 category favored 0.4896690
## 8 category brand sold 0.5047960
cor(bik_1$sold_count,bik_1$price)
## [1] -0.05910234
Obviously there is high correlation between sold counts and those basket count, visit count and category sold. This result actually makes sense because people tend to buy what they visit more and when the category has raised its attention all products sales’ will increase as well. Thanks to function created above, possible regressors are selected easily. However, potential problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potential regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=bik_1$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_1$basket_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept bik_1$basket_count
## 0.771 -0.4691 -0.6327 0.7984 0.001
## s.e. NaN 0.0375 NaN NaN NaN
##
## sigma^2 estimated as 0.06242: log likelihood = -3.35, aic = 18.69
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=bik_1$visit_count)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_1$visit_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept bik_1$visit_count
## 0.8744 -0.505 -0.8680 0.9139 0
## s.e. 0.0942 0.089 0.0636 0.0005 NaN
##
## sigma^2 estimated as 0.06188: log likelihood = -3.36, aic = 18.71
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=bik_1$category_sold)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_1$category_sold)
##
## Coefficients:
## ar1 ar2 ma1 intercept bik_1$category_sold
## 0.9439 -0.4990 -1.0000 0.9718 0
## s.e. 0.0880 0.0875 0.0332 0.0001 NaN
##
## sigma^2 estimated as 0.06025: log likelihood = -3.59, aic = 19.18
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,98),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.007614923 0.305363 0.009309287 0.009309287
accu(ts11,tail(head(fitted_transformedx1,98),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.007010133 0.2776748 0.008465184 0.008465184
accu(ts11,tail(head(fitted_transformedx2,98),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.007210265 0.277234 0.008451745 0.008451745
accu(ts11,tail(head(fitted_transformedx3,98),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.00780154 0.2981519 0.009089447 0.009089447
As it is seen in the results, they all have similar results but model arimax1 is slightly better in terms of lower WMAPE and other parameters. Moreover, model shows better performance with arimax1 (regressor as basket count) than arima model in previous step.
After all of tests and operations done above, arimax1 model gives best results and it is selected as final model.
In this step, seasonality of each product is analyzed to make appropriate decomposition.
The plot below shows the sold count of product-9 for related period.
As it is seen above, there is a pattern which fluctuates up and down.The variance seems increasing through time. That idea direct us to check autocorrelation so that determine related seasonality.
Table below shows autocorrelation values at each lag until lag 100.
The first lags have decreased over time and that reflects us a increasing trend of our data. Since our data is mostly deleted there are only 2 significant ACF values of lag1 and lag2. We can not speak certainly because of the dirty nature of data but it would not be incorrect to say that autocorrelation increases a little at lag 7, which makes one think of weekly seasonality.
As far as it is seen below, there is a decreasing then increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(bik_2$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series are values that are not above the significance levels. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance and nearly constant mean. Its trend seems to increase with fluctuations which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts12=ts(bik_2$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")
The reason why acf values are higher at frequency 15 might be that in Turkey, most people get their salary either at day 1 or day 15 of related month. (Source) This reality may give us low acf values and in the lag 1.0 this gives us a significant autocorrelation. It corresponds to the 15th day of the tiem series data. Probably, people tend to shop more after getting their salary. Even if people generally use credit cards, getting fresh money gives confidence to people to spend more on shopping products. However, as it is mentioned above, acf result of detrended series is worse now. Moreover, the graph below shows the plot of detrended values, which is worse than the one in frequency=7.
Detrended values (random) are less stationary in this case. They do not have constant variance but their mean isconstant compared to previous one. Trend in this series are not so fluctuating but it moves up an down through time. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.
For the further parts, we will continue with the model of frequency = 7.
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
There is no clear sign of which model to use. So, AR(2),MA(1) and ARIMA(2,0,1) will be examined.
AR(2) is creted below:
model11=arima(decomposed11$random,order=c(2,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 0))
##
## Coefficients:
## ar1 ar2 intercept
## 0.3203 -0.3775 0.9892
## s.e. 0.0948 0.0940 0.0280
##
## sigma^2 estimated as 0.08229: log likelihood = -16.35, aic = 40.7
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are close to real sales amount (black). However this is a simple model with not a lot of data points and does not fit a high percentage amount of data points. They are little lagged because of the nature of AR models.
Residuals show constant mean but variance differs sometimes. ACF and PACF plot still shows some autocorrelation signs that are above the significance levels at lag 7.
MA(1) is creted below:
model12=arima(decomposed11$random,order=c(0,0,1))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 1))
##
## Coefficients:
## ma1 intercept
## 0.3650 0.9880
## s.e. 0.0936 0.0423
##
## sigma^2 estimated as 0.09173: log likelihood = -21.4, aic = 48.79
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too. It is lagged due to the nature of the MA models.
Residuals show fluctuating mean and variance seems not very good in this model.However, ACF and PACF values are better in this model,
Arima(2,0,1) is created below:
model13=arima(decomposed11$random,order =c(2,0,1))
print(model13)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1))
##
## Coefficients:
## ar1 ar2 ma1 intercept
## 0.9430 -0.4970 -0.9058 0.9906
## s.e. 0.0924 0.0895 0.0542 0.0051
##
## sigma^2 estimated as 0.06702: log likelihood = -7.22, aic = 24.44
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black). The model fits more when variance is low. Outlier points lowers the fitting point of our model.
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(2,0,1)")
Residuals show constant mean but variance is too high and fluctuating.
AIC(model11)
## [1] 40.69692
BIC(model11)
## [1] 50.91243
AIC(model12)
## [1] 48.79104
BIC(model12)
## [1] 56.45267
AIC(model13)
## [1] 24.44036
BIC(model13)
## [1] 37.20975
As it is seen above, model13 which is autoarima with ARIMA(2,0,1) is the best model with lower values. ARIMA(2,0,1) should be selected for further steps.
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,31),28))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.1186466 2.884616 0.08794029 0.08794029
accu(ts11,tail(head(fitted_transformed12,31),28))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.1247375 2.98147 0.09089297 0.09089297
accu(ts11,tail(head(fitted_transformed13,31),28))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.1132647 2.620198 0.07987925 0.07987925
Again third result which is ARIMA(2,0,1) gives the best result with lower errors in MAPE, MAD, MADP, WMAPE tests.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
correlation(bik_2)
## Variable Correlation
## 1 visit count 0.7099228
## 2 basket count 0.7960412
## 3 favored count 0.4806033
## 4 category sold 0.6833365
## 5 category visit 0.2349762
## 6 category basket 0.4955038
## 7 category favored 0.4896690
## 8 category brand sold 0.5047960
cor(bik_2$sold_count,bik_2$price)
## [1] -0.05910234
Obviously there is high correlation between sold counts and those basket count, visit count and favıred count. This result actually makes sense because people tend to buy what they visit more and when the category has raised its attention all products sales’ will increase as well. Thanks to function created above, possible regressors are selected easily. However, potential problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potential regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=bik_2$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_2$basket_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept bik_2$basket_count
## 0.771 -0.4691 -0.6327 0.7984 0.001
## s.e. NaN 0.0375 NaN NaN NaN
##
## sigma^2 estimated as 0.06242: log likelihood = -3.35, aic = 18.69
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=bik_2$visit_count)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_2$visit_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept bik_2$visit_count
## 0.8744 -0.505 -0.8680 0.9139 0
## s.e. 0.0942 0.089 0.0636 0.0005 NaN
##
## sigma^2 estimated as 0.06188: log likelihood = -3.36, aic = 18.71
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=bik_2$favored_count)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_2$favored_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept bik_2$favored_count
## 0.8885 -0.5054 -0.8556 0.9579 1e-04
## s.e. 0.0922 0.0886 0.0557 NaN NaN
##
## sigma^2 estimated as 0.06526: log likelihood = -5.83, aic = 23.65
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,31),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.04803259 0.6914002 0.021078 0.021078
accu(ts11,tail(head(fitted_transformedx1,31),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.03311664 0.5477929 0.01669999 0.01669999
accu(ts11,tail(head(fitted_transformedx2,31),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.04203058 0.6142789 0.01872689 0.01872689
accu(ts11,tail(head(fitted_transformedx3,31),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.04493964 0.6470274 0.01972526 0.01972526
AIC(model13)
## [1] 24.44036
AIC(modelx1)
## [1] 18.69395
AIC(modelx2)
## [1] 18.71035
AIC(modelx3)
## [1] 23.65499
As it is seen in the results, they all have similar results but model arimax1 is slightly better in terms of lower WMAPE and other parameters. Moreover, model shows better performance with arimax1 (regressor as basket count) than arima model in previous step.
After all of tests and operations done above, arimax1 model gives best results and it is selected as final model.